Bikesharing Dataset

Andrea Di Simone


Setting up the environment

Some relevant imports are executed here. The plot style is also chosen, and interactive plotting is selected.

to top

In [1]:
import numpy as np
import pandas as pd
import datetime as dt

from sklearn import model_selection
from sklearn import feature_selection
from sklearn import preprocessing
from sklearn import metrics
from sklearn import ensemble

from sklearn import neighbors
from sklearn import linear_model
from sklearn import naive_bayes
from sklearn import neural_network
from sklearn import svm

from sklearn.decomposition import PCA


from collections import defaultdict


# table formatting
from IPython.display import display

import matplotlib.pyplot as plt
plt.style.use('ggplot')

# set interactive plotting on
plt.ion()

plt.rc('text', usetex=True)

Steering the execution

In the following cell, one can choose which regression algorithms need to be trained and evaluated. The models dictionary maps a list of names (free text) to the model implementation in sklearn and its arguments. The input csv file is defined in the input_file variable, and max_percentage is the maximum fraction of entries to be used for training.

to top

In [23]:
models={
    
    'KNeighborsRegressor':[neighbors.KNeighborsRegressor,{'weights':'distance','n_neighbors':5}],
    'Linear Regression':[linear_model.LinearRegression,{'copy_X':True}],
    'BDT':[ensemble.RandomForestRegressor,{'n_estimators':100}],
    'Ridge':[linear_model.Ridge,{'copy_X':True}],
#    'Ridge2':[linear_model.Ridge,{'copy_X':True,'alpha':2.0}],
    'Lasso':[linear_model.Lasso,{}],
#    'Lasso2':[linear_model.Lasso,{'alpha':2.0}],
    'MLPRegressor':[neural_network.MLPRegressor,{}], 
#    'MLPRegressor2':[neural_network.MLPRegressor,{'hidden_layer_sizes': (100,50)}], 

    'SVR':[svm.SVR,{}]
}

input_file = "hour.csv"

max_percentage=30

Reading/inspecting the inputs

The following code reads the csv file in a DataFrame and plots the variables, both as a function of time and as histograms.

to top

In [3]:
# default delimiter
df = pd.read_csv(input_file, header = 0)

# overview of original inputs

df[['season', 'yr', 'mnth', 'hr']].plot(subplots=True, layout=(2,2),title="\huge{Original Inputs}", figsize=(10,10))
df[['season', 'yr', 'mnth', 'hr']].plot.hist(subplots=True, layout=(2,2),title="\huge{Original Inputs, hist}", figsize=(10,10))
df[[ 'holiday','weekday', 'workingday', 'weathersit']].plot(subplots=True,layout=(2,2),title="\huge{Original Inputs}", figsize=(10,10))
df[[ 'holiday','weekday', 'workingday', 'weathersit']].plot.hist(subplots=True,layout=(2,2),title="\huge{Original Inputs, hist}", figsize=(10,10))
df[[ 'temp', 'atemp', 'hum','windspeed']].plot(subplots=True,layout=(2,2),title="\huge{Original Inputs}", figsize=(10,10))
df[[ 'temp', 'atemp', 'hum','windspeed']].plot.hist(subplots=True,layout=(2,2),title="\huge{Original Inputs, hist}", figsize=(10,10))
df[[ 'casual', 'registered', 'cnt']].plot(subplots=True,layout=(3,1),title="\huge{Original Inputs}", figsize=(10,10))
df[[ 'casual', 'registered', 'cnt']].plot.hist(subplots=True,layout=(3,1),title="\huge{Original Inputs, hist}", figsize=(10,10))
Out[3]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f064ba79c90>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f064bac5090>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f064bc1ee10>]], dtype=object)

Preparation of data

This cell manipulates the inputs and prepares them for processing. In particular, the following steps are performed:

  • Convert the dteday column in a numeric format. Since year and month are already stored in the dataset as separate column, I choose to keep only the day of the month.
  • There are three possible target variables, representing the total number of users, the number of registered users and the number of casual users. I want to study the predictions for the three variable separately, so they are extracted from the dataset and stored as separate variables
  • The features need to be extracted from the dataset. I choose to ignore instant, since it is redundant, once we know date and time. The day of the month (i.e. dteday) seems to me to be irrelevant for this particular problem: season, weather, and weekends/holidays are recorded in separate fields in the same dataset.
  • I convert some of the features into categoricals, using get_dummy. This, together with interaction terms, may add some capacity to linear models
  • I add 2nd degree interactions
  • in the case of high dimensionality (i.e. with categoricals+interactions) I experiment with some feature selection: PCA, model-based
  • In principle, the shapes of all the variables look good. The targets are, however, peaked at zero, which may pose some problems. I choose to transform them to log+1. This will reduce the range, as well as giving a more regular shape.
  • Once the features have been defined, I choose to scale them, so one can also test algorithms that are not scale-invariant
In [8]:
from sklearn.preprocessing import PolynomialFeatures

myX=defaultdict()
myX_scaled=defaultdict()

#this gives back the column names
original_headers = list(df.columns.values)

# tha date field has an annoying format, and month and year are redundant.
# let's turn it into an integer representing the day of the month

df['day']=df['dteday'].apply(lambda x: dt.datetime.strptime(str(x), "%Y-%m-%d").date().day)

# extract our target variables
myY=np.log(df[['cnt']]+1)
myYc=np.log(df[['casual']]+1)
myYr=np.log(df[['registered']]+1)


myY.plot.hist(title="\huge{Total users, log}", figsize=(10,10))
myYc.plot.hist(title="\huge{Casual users, log}", figsize=(10,10))
myYr.plot.hist(title="\huge{Registered users, log}", figsize=(10,10))


# extract the features. 'instant' has no real information, once we know date and time.

myX['plain']=df[[ column for column in list(df.columns.values) if column not in ['cnt' ,'dteday','instant','casual','registered'] ]]

# use OneHotEncoder for categorical variables. I prefer to use it column by column

myX['cat']=pd.get_dummies(myX['plain'],columns=['weekday','workingday','holiday','season','yr'])

# use polynomial features to add interactions

poly=PolynomialFeatures(degree=2,interaction_only=True,include_bias=False).fit(myX['cat'])
myX['int']=poly.transform(myX['cat'])
myX['int']=pd.DataFrame(myX['int'])
myX['int'].columns=list(myX['cat'].columns)+list(poly.get_feature_names()[len(myX['cat'].columns):])

# feature selection using PCA

myX['pca']=PCA(n_components=50,random_state=0).fit_transform(myX['int'].values)
myX['pca']=pd.DataFrame(myX['pca'])

print("Selected features using PCA")

# feature selection using selectpercentile

myX['perc']=feature_selection.SelectPercentile(
    score_func=feature_selection.f_regression,
    percentile=50
    ).fit_transform(myX['int'].values, myYr.values.ravel())
myX['perc']=pd.DataFrame(myX['perc'])

print("Selected features using SelectPercentile. Number of features is {}".format(myX['perc'].shape[1])")

# feature selection using a BDT

myX['bdt']=feature_selection.SelectFromModel(
    ensemble.RandomForestRegressor(n_estimators=100), threshold='median'
    ).fit_transform(myX['int'].values, myYr.values.ravel())
myX['bdt']=pd.DataFrame(myX['bdt'])


print("Selected features using BDT. Number of features is {}".format(myX['bdt'].shape[1]))
                                 
# scale inputs, for models that are not scale-invariant

scaler = preprocessing.MinMaxScaler()

# scale inputs and put them back in a DF

for input in myX.keys():

    myX_scaled[input]=myX[input].values
    myX_scaled[input]=scaler.fit_transform(myX_scaled[input])
    myX_scaled[input]=pd.DataFrame(myX_scaled[input])
    myX_scaled[input].columns=myX[input].columns

print("Scaled inputs. Full list of inputs is {}".format(myX_scales.keys()))
Selected features using PCA
Selected features using SelectPercentile
Selected features using BDT. Number of features is 163

Check data after manipulation

A few histograms to cross check that the inputs have not been inadvertently corrupted by the extraction/manipulation in the previous cells

to top

In [9]:
myX_scaled['plain'][['season', 'yr', 'mnth', 'hr']].plot.hist(subplots=True, layout=(2,2),title="\huge{Scaled Features, hist}",bins=24,figsize=(10,10))
myX_scaled['plain'][[ 'holiday','weekday', 'workingday', 'weathersit']].plot.hist(subplots=True,layout=(2,2),title="\huge{Scaled Features, hist}",bins=24,figsize=(10,10))
myX_scaled['plain'][[ 'temp', 'atemp', 'hum','windspeed']].plot.hist(subplots=True,layout=(2,2),title="\huge{Scaled Features, hist}",bins=24,figsize=(10,10))
Out[9]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f0649109850>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f064f218f90>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f064b92dcd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f064bdf0a10>]], dtype=object)

Creating training and test sets

The code splits the inputs in different test/training pairs with varying number of events. The loop runs on all fractions from 0.01 to the maximum defined in max_percentage, in steps of 0.01. To save some programming time, the built-in model_selection.train_test_split is used for the splitting itself. At the end of this cell, there will be one train/test pair for each desired training fraction.

to top

In [10]:
fractions_test=[x*0.01 for x in range(1,max_percentage+1)]

myX_train=defaultdict(lambda: defaultdict() )
myX_test=defaultdict(lambda: defaultdict() )

#for input in myX.keys():
#    # init defaultdicts so we can use them in assignment
#    myX_train[input]=None
#    myX_test[input]=None

myY_train=dict()
myY_test=dict()
myYc_train=dict()
myYc_test=dict()
myYr_train=dict()
myYr_test=dict()

seed=0
# generate samples
for fraction in fractions_test:

    # prepare input tuple for train_test_split
    inputtuple=()    
    
    for indata in myX.keys():
        inputtuple=inputtuple+(myX[indata],)

    inputtuple=inputtuple+(myY, myYc, myYr)
            
    outputtuple = model_selection.train_test_split(*inputtuple,test_size=1-fraction, random_state=seed)
    
    # order in outputtuple is the same as order in myX.keys()
    
    for res in myX.keys():
        myX_train[res][fraction]=outputtuple[0]
        myX_test[res][fraction]=outputtuple[1]
        # pop values as they are extracted
        outputtuple=outputtuple[2:]
 
    # now assign features
    myY_train[fraction]=outputtuple[-6]
    myY_test[fraction]=outputtuple[-5]
    myYc_train[fraction]=outputtuple[-4]
    myYc_test[fraction]=outputtuple[-3]
    myYr_train[fraction]=outputtuple[-2]
    myYr_test[fraction]=outputtuple[-1]

Model fitting

The following code loops on the desired training fractions and on the configured models. For each fraction/model, three regressors are fitted: one for the total number of users, one for the number of registered users and one for the number of casual users.

After fitting the model, the mean absolute error is calculated (using the built-in metrics.mean_absolute_error). The scores defined for each algorithm are also evaluated.

Since the exercise requires to study in particular a training fraction of 10%, information corresponding to this step is saved for further inspection (the target_* variables).

The loop may take a few minutes, depending on how many models have been activated.

to top

In [24]:
error=defaultdict(lambda: defaultdict(list))
errorc=defaultdict(lambda: defaultdict(list))
errorr=defaultdict(lambda: defaultdict(list))
errorsum=defaultdict(lambda: defaultdict(list))

train_error=defaultdict(lambda: defaultdict(list))
train_errorc=defaultdict(lambda: defaultdict(list))
train_errorr=defaultdict(lambda: defaultdict(list))
train_errorsum=defaultdict(lambda: defaultdict(list))

score=defaultdict(lambda: defaultdict(list))
scorec=defaultdict(lambda: defaultdict(list))
scorer=defaultdict(lambda: defaultdict(list))

# save here the predictions for the target training fraction
target_prediction=defaultdict(lambda: defaultdict(list))
target_predictionr=defaultdict(lambda: defaultdict(list))
target_predictionc=defaultdict(lambda: defaultdict(list))

# save here the fitted models
target_model=defaultdict(lambda: defaultdict())
target_modelr=defaultdict(lambda: defaultdict())
target_modelc=defaultdict(lambda: defaultdict())


# full inputs
inputs=myX.keys()
# selected inputs
#inputs=['pca']

# setup a simple progress bar

import sys
total_steps=len(inputs)*len(fractions_test)*len(models)
done_steps=0

# now  loop

for input in inputs:

    for fraction in fractions_test:

        for name,model in models.iteritems():

            # skip low-stat training for high-dim configurations
            if input=='plain' or (input =='cat' and fraction>0.05) or (input in['int','pca', 'bdt','perc'] and fraction>=0.1):
                # use custom arguments
                themodel=model[0](** model[1])
                themodelc=model[0](** model[1])
                themodelr=model[0](** model[1])

                #print 'fitting a',name,'model with training fraction',fraction

                themodel.fit(myX_train[input][fraction], np.ravel(myY_train[fraction]))
                themodelc.fit(myX_train[input][fraction], np.ravel(myYc_train[fraction]))
                themodelr.fit(myX_train[input][fraction], np.ravel(myYr_train[fraction]))

                # cache results
                result= themodel.predict(myX_test[input][fraction])
                resultc=themodelc.predict(myX_test[input][fraction])
                resultr=themodelr.predict(myX_test[input][fraction])
                train_result= themodel.predict(myX_train[input][fraction])
                train_resultc=themodelc.predict(myX_train[input][fraction])
                train_resultr=themodelr.predict(myX_train[input][fraction])

                if fraction==0.1:
                    target_prediction[input][name]=result
                    target_predictionr[input][name]=resultr
                    target_predictionc[input][name]=resultc
                    target_model[input][name]= themodel
                    target_modelr[input][name]=themodelr
                    target_modelc[input][name]=themodelc

                error[input][name].append(metrics.mean_absolute_error(np.exp(result)-1,np.exp(np.ravel(myY_test[fraction]))-1))
                errorc[input][name].append(metrics.mean_absolute_error(np.exp(resultc)-1,np.exp(np.ravel(myYc_test[fraction]))-1))
                errorr[input][name].append(metrics.mean_absolute_error(np.exp(resultr)-1,np.exp(np.ravel(myYr_test[fraction]))-1))
                errorsum[input][name].append(metrics.mean_absolute_error(np.exp(resultc)+np.exp(resultr)-2, np.exp(np.ravel(myY_test[fraction]))-1))

                train_error[input][name].append(metrics.mean_absolute_error(np.exp(train_result)-1,np.exp(np.ravel(myY_train[fraction]))-1))
                train_errorc[input][name].append(metrics.mean_absolute_error(np.exp(train_resultc)-1,np.exp(np.ravel(myYc_train[fraction]))-1))
                train_errorr[input][name].append(metrics.mean_absolute_error(np.exp(train_resultr)-1,np.exp(np.ravel(myYr_train[fraction]))-1))
                train_errorsum[input][name].append(metrics.mean_absolute_error(np.exp(train_resultc)+np.exp(train_resultr)-2, np.exp(np.ravel(myY_train[fraction]))-1))

                score[input][name].append(themodel.score(myX_test[input][fraction].as_matrix(), np.ravel(myY_test[fraction])))
                scorec[input][name].append(themodelc.score(myX_test[input][fraction].as_matrix(), np.ravel(myYc_test[fraction])))
                scorer[input][name].append(themodelr.score(myX_test[input][fraction].as_matrix(), np.ravel(myYr_test[fraction])))

            else:
                    
                error[input][name].append(np.nan)
                errorc[input][name].append(np.nan)
                errorr[input][name].append(np.nan)
                errorsum[input][name].append(np.nan)

                train_error[input][name].append(np.nan)
                train_errorc[input][name].append(np.nan)
                train_errorr[input][name].append(np.nan)
                train_errorsum[input][name].append(np.nan)

                score[input][name].append(np.nan)
                scorec[input][name].append(np.nan)
                scorer[input][name].append(np.nan)

                
            done_steps+=1

            # update progress 

            sys.stdout.write("\r %i %% done" % (100.*done_steps/total_steps))
            sys.stdout.flush()
                
                
 100 % done

Inspection of results

The following cell creates some plots and tables to understand the performance of the fitted models. For each model, the same set of plots are created:

  • For linear regression models, the coefficients are printed, to provide some guidance concerning feature ranking
  • Only for the 10% training fraction, residual plots show the difference (prediction-true value) versus the true value. This is shown for the three different targets (total number of users, registered users, casual users). The direct prediction for the total number of users is compared to the sum of the two separate predictions for casual and registered users.
  • For all studied training fractions, the mean absolute error is plotted as a function of the training fraction. Again, the performance for the total number of users is compared to the one one would get when adding the two separate predictions for casual and registered users.

to top

In [33]:
# first loop to print a summary table of performance at target training percentage

summaryTable=pd.DataFrame(columns=models.keys(),index=inputs)


for name in models.keys():

    for input in inputs:
        
       summaryTable.loc[input,name]= error[input][name][9]
        

summaryTable
Out[33]:
Ridge Linear Regression BDT KNeighborsRegressor MLPRegressor SVR Lasso
perc 105.555 106.821 39.198 74.7995 123.713 126.063 127.717
int 106.803 108.113 39.4854 74.0333 101.661 121.505 127.717
plain 108.097 107.973 41.2652 72.0764 95.331 55.812 130.39
cat 106.471 106.422 40.5148 67.757 80.3455 58.6103 130.39
bdt 105.245 105.851 39.5334 73.7397 117.588 130.535 127.717
pca 109.995 109.995 77.8955 74.0535 165.123 136.88 128.352
In [25]:
for name in models.keys():

    for input in inputs:
            
        # inspect feature ranking

        if name=='Linear Regression' or name=='Ridge':

        
            lincoeff=pd.DataFrame(zip(myX_test[input][0.1].columns, target_model[input][name].coef_,target_modelc[input][name].coef_,target_modelr[input][name].coef_),columns=['feature','coefficient (total)',"coefficient (registered)", "coefficient (casual)"])
            print '\n Coefficients for', name,'with input',input,'\n'
            display(lincoeff)
            
        figres=plt.figure(figsize=(10,10))
        figres.canvas.set_window_title(name)
        figres.suptitle("Summary plots: "+name+" inputs "+input,fontsize=20)
        plt.subplots_adjust(hspace = .3)
    
        subres_0=plt.subplot2grid((3,2),(0,0))
        subres_0.scatter(fractions_test,error[input][name],label="total users",marker='s',s=40,c='b')
        subres_0.scatter(fractions_test,errorsum[input][name],label="r+c users",marker='o',s=20,c='r')
        subres_0.set_ylabel('Mean absolute error')
        subres_0.set_xlabel('Training fraction')
        plt.legend(loc='best');

        subres_0_1=plt.subplot2grid((3,2),(0,1))
        subres_0_1.scatter(fractions_test,train_error[input][name],label="total users, training",marker='s',facecolor='none',s=40,c='b')
        subres_0_1.scatter(fractions_test,errorsum[input][name],label="r+c users, training",marker='o',facecolor='none',s=20,c='r')
        subres_0_1.set_ylabel('Mean absolute error')
        subres_0_1.set_xlabel('Training fraction')
        plt.legend(loc='best');

        # use same scale for training and test errors
        
        ylim1=subres_0.get_ylim()
        ylim2=subres_0_1.get_ylim()
        ylimbest=min(ylim1[0],ylim2[0]),max(ylim1[1],ylim2[1])
        subres_0.set_ylim(ylimbest)
        subres_0_1.set_ylim(ylimbest)

        subres_1=plt.subplot2grid((3,2),(1,0),rowspan=2)
        subres_1.hist2d(np.exp(np.ravel(myY_test[0.1]))-1,np.exp(target_prediction[input][name])-np.exp(np.ravel(myY_test[0.1])),bins=(50,50),range=((0,500),(-500,500)),cmap=plt.cm.jet)
        subres_1.set_xlabel('True value (total users)')
        subres_1.set_ylabel('Residual')

        subres_2=plt.subplot2grid((3,2),(1,1))
        subres_2.hist2d(np.exp(np.ravel(myYc_test[0.1]))-1,np.exp(target_predictionc[input][name])-np.exp(np.ravel(myYc_test[0.1])),bins=(50,50),range=((0,200),(-200,200)),cmap=plt.cm.jet)
        subres_2.set_xlabel('True value (casual users)')
        subres_2.set_ylabel('Residual')
    
        subres_3=plt.subplot2grid((3,2),(2,1))
        subres_3.hist2d(np.ravel(np.exp(myYr_test[0.1]))-1,np.exp(target_predictionr[input][name])-np.exp(np.ravel(myYr_test[0.1])),bins=(50,50),range=((0,500),(-500,500)),cmap=plt.cm.jet)
        subres_3.set_xlabel('True value (registered users)')
        subres_3.set_ylabel('Residual')
    


                    
 Coefficients for Ridge with input perc 

feature coefficient (total) coefficient (registered) coefficient (casual)
0 0 -0.100119 -0.209888 -0.068678
1 1 0.030915 -0.001956 0.033523
2 2 -0.178589 -0.202031 -0.149185
3 3 0.707613 1.052447 0.598874
4 4 0.256710 0.719750 0.148784
5 5 -0.385083 -0.679291 -0.279940
6 6 0.512752 0.522287 0.494625
7 7 -0.019749 0.124205 -0.040569
8 8 0.019749 -0.124205 0.040569
9 9 -0.081450 -0.551571 0.041814
10 10 0.010920 -0.187092 0.030846
11 11 -0.703667 -0.761732 -0.606690
12 12 -0.099729 0.061586 -0.137278
13 13 0.099729 -0.061586 0.137278
14 14 0.000383 0.003758 -0.000423
15 15 -0.233016 -0.232128 -0.220820
16 16 0.324456 0.306763 0.293780
17 17 -0.135930 -0.162969 -0.129551
18 18 -0.048000 -0.080361 -0.025651
19 19 0.001407 0.003365 0.000842
20 20 0.023785 -0.009368 0.027758
21 21 0.001380 0.055399 -0.005813
22 22 -0.012899 -0.086753 0.015450
23 23 0.187123 0.363732 0.133308
24 24 0.222038 0.448818 0.163169
25 25 0.260674 0.405465 0.203221
26 26 -0.041966 -0.136034 -0.018047
27 27 -0.058153 -0.073854 -0.050630
28 28 -0.003029 -0.000646 -0.005195
29 29 0.023342 -0.027009 0.040723
... ... ... ... ...
132 132 0.125776 0.086669 0.136014
133 133 0.007179 -0.095520 0.011698
134 134 -0.044466 -0.084417 -0.037849
135 135 -0.075635 -0.030410 -0.094142
136 136 0.018713 0.023104 0.022603
137 137 -0.179127 -0.132368 -0.201991
138 138 -0.585813 -0.664421 -0.535324
139 139 -0.078508 -0.228478 -0.015687
140 140 -0.371211 -0.404514 -0.326146
141 141 -0.107169 0.021257 -0.128403
142 142 0.087421 0.102947 0.087834
143 143 0.019749 -0.124205 0.040569
144 144 -0.002943 -0.323093 0.057502
145 145 0.382131 0.217422 0.356992
146 146 0.045851 -0.398709 0.069662
147 147 0.007440 0.040329 -0.008876
148 148 0.012308 -0.164534 0.049445
149 149 0.001296 -0.290398 0.083600
150 150 0.114348 -0.704901 0.314499
151 151 0.399101 0.511141 0.348738
152 152 0.080769 0.273626 -0.001443
153 153 0.252295 0.266293 0.169721
154 154 -0.082746 -0.261173 -0.041785
155 155 -0.180498 -0.212040 -0.135835
156 156 -0.076065 -0.390376 0.003933
157 157 -0.005386 -0.161195 0.037881
158 158 0.518871 0.820518 0.400827
159 159 0.255327 0.679877 0.133204
160 160 -0.107405 -0.229677 -0.079501
161 161 -0.042807 -0.350592 0.045694

162 rows × 4 columns

 Coefficients for Ridge with input int 

feature coefficient (total) coefficient (registered) coefficient (casual)
0 mnth 0.030035 -0.013995 0.039652
1 hr 0.022023 -0.003563 0.024622
2 weathersit -0.051276 -0.055779 -0.041657
3 temp 0.584308 0.964324 0.468779
4 atemp 0.198047 0.637526 0.087938
5 hum -0.317524 -0.580379 -0.245395
6 windspeed 0.406413 0.487545 0.359560
7 day -0.004092 -0.007293 -0.003287
8 weekday_0 0.102508 0.169925 0.088278
9 weekday_1 0.191657 0.098018 0.207730
10 weekday_2 0.242515 0.168326 0.214483
11 weekday_3 -0.078757 -0.104147 -0.052747
12 weekday_4 -0.463589 -0.329888 -0.474817
13 weekday_5 0.092047 0.056283 0.106491
14 weekday_6 -0.086381 -0.058518 -0.089418
15 workingday_0 -0.064595 0.026667 -0.065896
16 workingday_1 0.064595 -0.026667 0.065896
17 holiday_0 0.080722 0.084739 0.064756
18 holiday_1 -0.080722 -0.084739 -0.064756
19 season_1 -0.053479 -0.332784 0.009016
20 season_2 0.293576 0.510852 0.214547
21 season_3 0.173037 0.089499 0.180578
22 season_4 -0.413134 -0.267567 -0.404140
23 yr_0 -0.036483 0.057837 -0.056822
24 yr_1 0.036483 -0.057837 0.056822
25 x0 x1 0.000085 0.003604 -0.000684
26 x0 x2 -0.023322 -0.002844 -0.029227
27 x0 x3 -0.276386 -0.265284 -0.258240
28 x0 x4 0.334338 0.307497 0.300997
29 x0 x5 -0.171982 -0.175323 -0.167994
... ... ... ... ...
295 x16 x23 0.035735 0.054402 0.026012
296 x16 x24 0.028859 -0.081069 0.039885
297 x17 x18 0.000000 0.000000 0.000000
298 x17 x19 0.066522 -0.113077 0.118236
299 x17 x20 0.060246 0.423755 -0.038643
300 x17 x21 0.077098 -0.271082 0.132601
301 x17 x22 -0.123145 0.045144 -0.147438
302 x17 x23 0.076305 0.204268 0.044074
303 x17 x24 0.004417 -0.119528 0.020682
304 x18 x19 -0.120001 -0.219707 -0.109221
305 x18 x20 0.233330 0.087097 0.253190
306 x18 x21 0.095939 0.360581 0.047977
307 x18 x22 -0.289989 -0.312710 -0.256702
308 x18 x23 -0.112788 -0.146430 -0.100896
309 x18 x24 0.032066 0.061691 0.036140
310 x19 x20 0.000000 0.000000 0.000000
311 x19 x21 0.000000 0.000000 0.000000
312 x19 x22 0.000000 0.000000 0.000000
313 x19 x23 -0.145230 -0.316589 -0.103052
314 x19 x24 0.091751 -0.016195 0.112067
315 x20 x21 0.000000 0.000000 0.000000
316 x20 x22 0.000000 0.000000 0.000000
317 x20 x23 0.196873 0.254129 0.162480
318 x20 x24 0.096703 0.256723 0.052067
319 x21 x22 0.000000 0.000000 0.000000
320 x21 x23 0.103938 0.104072 0.103569
321 x21 x24 0.069098 -0.014573 0.077009
322 x22 x23 -0.192065 0.016225 -0.219819
323 x22 x24 -0.221068 -0.283792 -0.184321
324 x23 x24 0.000000 0.000000 0.000000

325 rows × 4 columns

 Coefficients for Ridge with input plain 

feature coefficient (total) coefficient (registered) coefficient (casual)
0 season 0.151968 0.082595 0.158561
1 yr 0.350029 0.173477 0.380470
2 mnth -0.001747 0.007324 -0.001424
3 hr 0.095383 0.076017 0.095903
4 holiday -0.054306 -0.079973 -0.054669
5 weekday 0.020405 0.024558 0.019608
6 workingday 0.044523 -0.589933 0.180677
7 weathersit -0.045953 -0.027760 -0.041410
8 temp 0.849571 1.781900 0.642259
9 atemp 1.319198 2.223939 1.144477
10 hum -1.314615 -1.834025 -1.191393
11 windspeed 0.480496 0.350614 0.510943
12 day 0.002674 0.001888 0.002695
 Coefficients for Ridge with input cat 

feature coefficient (total) coefficient (registered) coefficient (casual)
0 mnth -0.000058 0.017167 -0.001418
1 hr 0.094030 0.074420 0.094682
2 weathersit -0.057104 -0.046552 -0.050385
3 temp 1.373169 2.522577 1.091746
4 atemp 1.238445 1.934757 1.109188
5 hum -1.289186 -1.853117 -1.158840
6 windspeed 0.399297 0.171518 0.453767
7 day 0.001945 0.000797 0.002076
8 weekday_0 -0.036200 0.037228 -0.052662
9 weekday_1 0.028816 -0.005757 0.029533
10 weekday_2 -0.050799 -0.039136 -0.051120
11 weekday_3 -0.093295 -0.169669 -0.068402
12 weekday_4 -0.075072 -0.171002 -0.051576
13 weekday_5 0.172884 0.178557 0.164398
14 weekday_6 0.053666 0.169778 0.029830
15 workingday_0 -0.028063 0.221549 -0.081710
16 workingday_1 0.028063 -0.221549 0.081710
17 holiday_0 0.045529 -0.014543 0.058878
18 holiday_1 -0.045529 0.014543 -0.058878
19 season_1 -0.156420 -0.094504 -0.162399
20 season_2 0.002162 0.214807 -0.040456
21 season_3 -0.152559 -0.312202 -0.111113
22 season_4 0.306817 0.191898 0.313968
23 yr_0 -0.177108 -0.091479 -0.191605
24 yr_1 0.177108 0.091479 0.191605
 Coefficients for Ridge with input bdt 

feature coefficient (total) coefficient (registered) coefficient (casual)
0 0 0.043231 0.012459 0.046100
1 1 0.030484 -0.006419 0.033058
2 2 -0.038456 -0.061050 -0.012138
3 3 0.672323 0.968866 0.566321
4 4 0.260028 0.686733 0.148254
5 5 -0.235562 -0.550791 -0.147445
6 6 0.473698 0.553270 0.439732
7 7 -0.003729 -0.008243 -0.003216
8 8 -0.014776 0.066313 -0.008832
9 9 0.014776 -0.066313 0.008832
10 10 -0.009133 -1.003527 0.255526
11 11 0.000405 0.003748 -0.000383
12 12 -0.013379 -0.001090 -0.016660
13 13 -0.348319 -0.358064 -0.317408
14 14 0.310733 0.246913 0.285946
15 15 -0.203130 -0.186966 -0.198661
16 16 -0.047897 -0.075029 -0.024359
17 17 0.002194 0.004372 0.001876
18 18 -0.013545 -0.022977 -0.009319
19 19 -0.000880 0.021024 -0.001571
20 20 0.018092 0.022352 0.014118
21 21 0.013490 -0.009212 0.015239
22 22 0.011986 0.002308 0.011223
23 23 0.008637 -0.002428 0.008206
24 24 0.005453 0.001391 0.008203
25 25 0.011404 -0.003239 0.012643
26 26 0.031827 0.015697 0.033456
27 27 0.023735 -0.005889 0.032340
28 28 0.019497 0.018347 0.013759
29 29 -0.004613 0.023061 -0.010128
... ... ... ... ...
133 133 0.216122 0.110315 0.234223
134 134 0.257576 0.442955 0.205509
135 135 0.002987 0.001533 0.004204
136 136 -0.003165 -0.016028 -0.001518
137 137 0.004995 0.005851 0.004290
138 138 -0.003006 0.003704 -0.003906
139 139 -0.004483 0.000857 -0.005893
140 140 -0.004550 -0.001532 -0.004861
141 141 0.003493 -0.002627 0.004468
142 142 -0.007732 -0.007475 -0.008259
143 143 0.004003 -0.000768 0.005042
144 144 0.010483 -0.001863 0.013715
145 145 -0.014212 -0.006380 -0.016931
146 146 -0.001148 0.012179 -0.003755
147 147 -0.000219 0.000824 -0.000130
148 148 0.009854 -0.001622 0.012782
149 149 -0.012216 -0.019624 -0.012113
150 150 -0.002393 -0.004743 -0.001952
151 151 -0.001336 -0.003500 -0.001265
152 152 0.433689 0.629477 0.312293
153 153 -0.075445 0.160265 -0.113910
154 154 0.060669 -0.093952 0.105078
155 155 0.014776 -0.066313 0.008832
156 156 0.019205 0.040062 -0.018111
157 157 -0.068477 0.091866 -0.103919
158 158 0.083253 -0.158179 0.112751
159 159 -0.101163 -0.374071 -0.029452
160 160 -0.252823 -0.450371 -0.193074
161 161 0.054556 -0.119675 0.083053
162 162 0.100373 -0.035966 0.145144

163 rows × 4 columns

 Coefficients for Ridge with input pca 

feature coefficient (total) coefficient (registered) coefficient (casual)
0 0 0.003197 0.003005 0.003144
1 1 -0.001918 -0.001433 -0.001912
2 2 0.009137 0.009222 0.008949
3 3 -0.002333 0.013294 -0.005826
4 4 -0.015796 -0.012722 -0.016216
5 5 -0.010757 -0.013596 -0.010124
6 6 0.016959 0.037543 0.013115
7 7 0.007706 0.018717 0.005365
8 8 -0.022458 -0.024637 -0.021066
9 9 -0.001319 -0.000997 -0.001768
10 10 0.009673 0.013534 0.008524
11 11 0.005235 0.008831 0.004239
12 12 -0.001879 -0.001818 -0.001761
13 13 -0.002140 -0.005893 -0.001418
14 14 0.005123 0.008170 0.003977
15 15 -0.034199 -0.024148 -0.034594
16 16 -0.006498 0.005883 -0.010064
17 17 -0.017995 -0.022353 -0.016726
18 18 0.014386 0.017356 0.012427
19 19 -0.020495 -0.045676 -0.016657
20 20 0.047502 0.066781 0.043893
21 21 -0.024798 -0.050369 -0.020001
22 22 0.064329 0.088526 0.057406
23 23 0.027173 0.032099 0.025271
24 24 0.010860 0.023772 0.008433
25 25 -0.007707 -0.011754 -0.007917
26 26 0.007203 -0.003585 0.009930
27 27 -0.012921 -0.020175 -0.011049
28 28 -0.004874 -0.001068 -0.004746
29 29 -0.003534 -0.017035 -0.000604
30 30 0.002652 0.011179 0.002749
31 31 -0.004658 -0.052782 0.004052
32 32 -0.002717 0.002031 -0.002705
33 33 0.005490 0.004063 0.006082
34 34 -0.040517 -0.038988 -0.036726
35 35 0.027773 0.042888 0.022945
36 36 0.029958 0.040483 0.030746
37 37 0.012808 -0.026537 0.021499
38 38 -0.004062 -0.013368 -0.002560
39 39 0.019878 0.000408 0.024289
40 40 0.015597 0.003602 0.016249
41 41 -0.018193 -0.020955 -0.015667
42 42 -0.021543 -0.048492 -0.014650
43 43 0.039292 0.039511 0.036460
44 44 0.026137 0.031995 0.019629
45 45 -0.027744 -0.057226 -0.029342
46 46 -0.013762 0.019761 -0.010875
47 47 0.005064 -0.027849 0.006597
48 48 -0.190630 -0.202282 -0.183236
49 49 0.071080 0.137504 0.053431
 Coefficients for Linear Regression with input perc 

feature coefficient (total) coefficient (registered) coefficient (casual)
0 0 -0.095650 -0.225096 -0.055106
1 1 0.026769 -0.009754 0.029954
2 2 -0.265884 -0.239441 -0.249740
3 3 14.698081 9.437119 14.917076
4 4 -10.022834 -3.653646 -10.729814
5 5 0.111580 -0.472254 0.255159
6 6 0.828924 1.398903 0.732782
7 7 -0.078644 0.094934 -0.099252
8 8 0.078644 -0.094934 0.099252
9 9 0.356190 -0.145048 0.464185
10 10 -0.210414 -0.557967 -0.169370
11 11 -1.104296 -1.101313 -1.026649
12 12 -0.149438 0.069333 -0.230892
13 13 0.149438 -0.069333 0.230892
14 14 0.000676 0.004299 -0.000140
15 15 -1.796887 -1.575525 -1.747440
16 16 1.926023 1.630547 1.852857
17 17 -0.141875 -0.145984 -0.140279
18 18 0.087826 0.021027 0.117374
19 19 0.001506 0.003739 0.000869
20 20 0.031665 -0.007068 0.036560
21 21 -0.000781 0.055361 -0.008034
22 22 -0.062283 -0.122232 -0.034443
23 23 0.183170 0.372268 0.121715
24 24 0.279047 0.550052 0.208037
25 25 0.342789 0.507344 0.271438
26 26 -0.045255 -0.150519 -0.016041
27 27 -0.050395 -0.074577 -0.039065
28 28 -0.002459 -0.000761 -0.004594
29 29 0.036951 0.019404 0.039306
... ... ... ... ...
132 132 0.185726 0.058200 0.211433
133 133 -0.465688 -0.412680 -0.473385
134 134 -0.103831 -0.114959 -0.097486
135 135 -0.589424 -0.392010 -0.615267
136 136 -0.007269 0.045117 0.005301
137 137 0.041935 -0.012844 0.006667
138 138 -0.689488 -0.880624 -0.590478
139 139 0.137507 -0.012020 0.181667
140 140 -0.501869 -0.605175 -0.444807
141 141 -0.108744 0.016857 -0.133812
142 142 0.030099 0.078077 0.034559
143 143 0.078644 -0.094934 0.099252
144 144 0.218683 -0.133028 0.282518
145 145 0.291455 0.047208 0.275437
146 146 -0.010074 -0.447146 0.007980
147 147 -0.040695 0.052476 -0.097080
148 148 0.119339 -0.147409 0.196332
149 149 0.163615 -0.077689 0.272629
150 150 0.182026 -1.326527 0.512583
151 151 0.642771 0.646322 0.667982
152 152 0.576809 0.719677 0.386488
153 153 0.980524 0.989585 0.644593
154 154 0.192575 -0.067359 0.191555
155 155 -0.726247 -0.650345 -0.617379
156 156 0.201503 -0.129164 0.270606
157 157 0.154687 -0.015884 0.193578
158 158 0.644279 1.032993 0.524550
159 159 0.314240 0.771336 0.207284
160 160 -0.177847 -0.350720 -0.131914
161 161 -0.141642 -0.474065 -0.038057

162 rows × 4 columns

 Coefficients for Linear Regression with input int 

feature coefficient (total) coefficient (registered) coefficient (casual)
0 mnth 0.018515 -0.029839 0.030097
1 hr 0.019515 -0.011074 0.022964
2 weathersit -0.102991 -0.032419 -0.106239
3 temp 7.578086 7.117100 7.575847
4 atemp -6.419727 -4.073720 -6.985722
5 hum -0.102332 -0.672597 -0.028472
6 windspeed 0.043523 0.508037 -0.089558
7 day -0.003806 -0.005853 -0.002629
8 weekday_0 0.089409 0.171896 0.056197
9 weekday_1 0.359961 0.242022 0.388153
10 weekday_2 0.367867 0.293058 0.334064
11 weekday_3 -0.091809 -0.197749 -0.017435
12 weekday_4 -0.685237 -0.457461 -0.714822
13 weekday_5 0.038451 0.009751 0.047956
14 weekday_6 -0.078642 -0.061518 -0.094112
15 workingday_0 -0.057527 0.038635 -0.038577
16 workingday_1 0.057527 -0.038635 0.038577
17 holiday_0 0.068293 0.071743 0.000662
18 holiday_1 -0.068293 -0.071743 -0.000662
19 season_1 0.094443 -0.128744 0.099146
20 season_2 0.217018 0.419178 0.162656
21 season_3 0.088105 -0.178159 0.160196
22 season_4 -0.399566 -0.112275 -0.421999
23 yr_0 -0.004969 0.098930 -0.029029
24 yr_1 0.004969 -0.098930 0.029029
25 x0 x1 0.000285 0.004095 -0.000449
26 x0 x2 -0.021379 -0.008800 -0.026399
27 x0 x3 -1.985444 -1.703216 -1.963471
28 x0 x4 2.093743 1.746312 2.055773
29 x0 x5 -0.172745 -0.127891 -0.173695
... ... ... ... ...
295 x16 x23 0.047563 0.070797 0.026137
296 x16 x24 0.009964 -0.109433 0.012441
297 x17 x18 0.000000 0.000000 0.000000
298 x17 x19 0.378665 0.245715 0.462491
299 x17 x20 0.002152 0.452729 -0.157856
300 x17 x21 -0.162049 -0.711367 -0.127244
301 x17 x22 -0.150475 0.084667 -0.176729
302 x17 x23 0.103606 0.240564 0.044450
303 x17 x24 -0.035312 -0.168821 -0.043788
304 x18 x19 -0.284222 -0.374458 -0.363345
305 x18 x20 0.214866 -0.033551 0.320512
306 x18 x21 0.250154 0.533208 0.287441
307 x18 x22 -0.249092 -0.196941 -0.245270
308 x18 x23 -0.108575 -0.141634 -0.073479
309 x18 x24 0.040281 0.069891 0.072817
310 x19 x20 0.000000 0.000000 0.000000
311 x19 x21 0.000000 0.000000 0.000000
312 x19 x22 0.000000 0.000000 0.000000
313 x19 x23 -0.068703 -0.224608 -0.054455
314 x19 x24 0.163146 0.095864 0.153602
315 x20 x21 0.000000 0.000000 0.000000
316 x20 x22 0.000000 0.000000 0.000000
317 x20 x23 0.156373 0.217033 0.131539
318 x20 x24 0.060645 0.202144 0.031118
319 x21 x22 0.000000 0.000000 0.000000
320 x21 x23 0.067292 -0.015393 0.096896
321 x21 x24 0.020813 -0.162766 0.063300
322 x22 x23 -0.159930 0.121897 -0.203008
323 x22 x24 -0.239636 -0.234172 -0.218991
324 x23 x24 0.000000 0.000000 0.000000

325 rows × 4 columns

 Coefficients for Linear Regression with input plain 

feature coefficient (total) coefficient (registered) coefficient (casual)
0 season 0.150707 0.079856 0.157586
1 yr 0.349915 0.172175 0.380623
2 mnth -0.001425 0.007909 -0.001154
3 hr 0.095006 0.075494 0.095554
4 holiday -0.054550 -0.081657 -0.054698
5 weekday 0.020467 0.024613 0.019678
6 workingday 0.044213 -0.592452 0.180829
7 weathersit -0.042328 -0.021952 -0.038256
8 temp 0.426985 1.314140 0.211045
9 atemp 1.816825 2.791372 1.647936
10 hum -1.344822 -1.877840 -1.218639
11 windspeed 0.519172 0.381934 0.552611
12 day 0.002714 0.001918 0.002738
 Coefficients for Linear Regression with input cat 

feature coefficient (total) coefficient (registered) coefficient (casual)
0 mnth -0.000227 0.017235 -0.001621
1 hr 0.093637 0.073851 0.094326
2 weathersit -0.054148 -0.042072 -0.047710
3 temp 1.391939 2.842541 1.005989
4 atemp 1.279529 1.693826 1.253732
5 hum -1.310068 -1.881875 -1.178423
6 windspeed 0.406970 0.143780 0.472027
7 day 0.001927 0.000730 0.002073
8 weekday_0 -0.035749 0.037581 -0.052180
9 weekday_1 0.029180 -0.004791 0.029648
10 weekday_2 -0.051026 -0.038622 -0.051538
11 weekday_3 -0.094036 -0.170386 -0.069148
12 weekday_4 -0.076410 -0.173562 -0.052584
13 weekday_5 0.174026 0.179224 0.165769
14 weekday_6 0.054014 0.170556 0.030032
15 workingday_0 -0.027841 0.222356 -0.081611
16 workingday_1 0.027841 -0.222356 0.081611
17 holiday_0 0.046106 -0.014219 0.059463
18 holiday_1 -0.046106 0.014219 -0.059463
19 season_1 -0.147468 -0.074867 -0.155995
20 season_2 -0.001560 0.210408 -0.044052
21 season_3 -0.163749 -0.337271 -0.118610
22 season_4 0.312778 0.201730 0.318657
23 yr_0 -0.176464 -0.089940 -0.191182
24 yr_1 0.176464 0.089940 0.191182
 Coefficients for Linear Regression with input bdt 

feature coefficient (total) coefficient (registered) coefficient (casual)
0 0 0.054434 0.014641 0.059190
1 1 0.025706 -0.017365 0.029400
2 2 -0.073659 0.113199 -0.093652
3 3 7.064030 5.769851 7.389642
4 4 -5.962329 -2.872189 -6.501972
5 5 0.136220 -0.631017 0.302223
6 6 0.286253 0.488061 0.288880
7 7 -0.004330 -0.009651 -0.003728
8 8 -0.280037 -0.159569 -0.263185
9 9 0.280037 0.159569 0.263185
10 10 0.905807 -0.395751 1.203640
11 11 0.000744 0.004138 -0.000032
12 12 -0.014992 -0.009480 -0.017723
13 13 -1.776980 -1.237093 -1.783297
14 14 1.728584 1.096087 1.736263
15 15 -0.202865 -0.137054 -0.205709
16 16 0.078711 -0.010028 0.118160
17 17 0.002631 0.005006 0.002237
18 18 -0.020664 -0.030241 -0.016106
19 19 -0.004087 0.015980 -0.004590
20 20 0.017294 0.020084 0.013922
21 21 0.016457 -0.008428 0.018490
22 22 0.021299 0.011999 0.020446
23 23 0.014984 0.001291 0.014877
24 24 0.009151 0.003955 0.012152
25 25 0.020521 -0.000111 0.022968
26 26 0.033912 0.014752 0.036223
27 27 0.022399 -0.011533 0.032268
28 28 0.032034 0.026174 0.026922
29 29 -0.066848 -0.025232 -0.073502
... ... ... ... ...
133 133 0.058885 0.032115 0.094213
134 134 0.227368 0.455946 0.194667
135 135 0.002151 0.001014 0.003463
136 136 -0.003544 -0.017192 -0.001722
137 137 0.005518 0.006113 0.004758
138 138 -0.002160 0.004223 -0.003257
139 139 -0.005984 0.001098 -0.007748
140 140 -0.002698 -0.000313 -0.003062
141 141 0.002387 -0.004595 0.003840
142 142 -0.006657 -0.006422 -0.007414
143 143 0.002327 -0.003228 0.003686
144 144 0.006864 -0.006809 0.010988
145 145 -0.011195 -0.002842 -0.014716
146 146 -0.002562 0.012449 -0.005540
147 147 0.002404 0.002741 0.002429
148 148 0.010611 -0.002215 0.013795
149 149 -0.014783 -0.022626 -0.014412
150 150 -0.003168 -0.005588 -0.002706
151 151 -0.001162 -0.004063 -0.001021
152 152 1.514437 1.508163 1.366869
153 153 -0.228875 0.078417 -0.271018
154 154 -0.051162 -0.237987 0.007832
155 155 0.280037 0.159569 0.263185
156 156 0.016400 0.062366 -0.026189
157 157 0.030739 0.232690 -0.019980
158 158 0.249298 -0.073121 0.283165
159 159 -0.250264 -0.513159 -0.171859
160 160 -0.254824 -0.508043 -0.181223
161 161 0.033571 -0.159566 0.067404
162 162 0.105516 -0.013567 0.147111

163 rows × 4 columns

 Coefficients for Linear Regression with input pca 

feature coefficient (total) coefficient (registered) coefficient (casual)
0 0 0.003197 0.003005 0.003144
1 1 -0.001918 -0.001432 -0.001912
2 2 0.009137 0.009222 0.008949
3 3 -0.002332 0.013294 -0.005825
4 4 -0.015796 -0.012722 -0.016216
5 5 -0.010756 -0.013596 -0.010124
6 6 0.016958 0.037543 0.013115
7 7 0.007707 0.018718 0.005365
8 8 -0.022458 -0.024637 -0.021067
9 9 -0.001319 -0.000997 -0.001768
10 10 0.009672 0.013533 0.008523
11 11 0.005236 0.008832 0.004239
12 12 -0.001880 -0.001819 -0.001762
13 13 -0.002140 -0.005893 -0.001418
14 14 0.005124 0.008171 0.003977
15 15 -0.034199 -0.024148 -0.034594
16 16 -0.006496 0.005887 -0.010062
17 17 -0.017994 -0.022351 -0.016725
18 18 0.014386 0.017356 0.012427
19 19 -0.020497 -0.045679 -0.016659
20 20 0.047505 0.066785 0.043896
21 21 -0.024798 -0.050369 -0.020000
22 22 0.064332 0.088531 0.057409
23 23 0.027172 0.032098 0.025269
24 24 0.010860 0.023773 0.008434
25 25 -0.007707 -0.011755 -0.007918
26 26 0.007200 -0.003587 0.009928
27 27 -0.012919 -0.020172 -0.011047
28 28 -0.004875 -0.001071 -0.004747
29 29 -0.003531 -0.017033 -0.000601
30 30 0.002653 0.011180 0.002750
31 31 -0.004660 -0.052788 0.004051
32 32 -0.002716 0.002034 -0.002704
33 33 0.005491 0.004063 0.006084
34 34 -0.040521 -0.038992 -0.036730
35 35 0.027775 0.042891 0.022947
36 36 0.029968 0.040496 0.030756
37 37 0.012809 -0.026544 0.021502
38 38 -0.004063 -0.013371 -0.002561
39 39 0.019880 0.000406 0.024292
40 40 0.015598 0.003599 0.016250
41 41 -0.018198 -0.020960 -0.015670
42 42 -0.021553 -0.048512 -0.014658
43 43 0.039300 0.039516 0.036467
44 44 0.026152 0.032016 0.019640
45 45 -0.027767 -0.057267 -0.029366
46 46 -0.013766 0.019797 -0.010877
47 47 0.005065 -0.027896 0.006601
48 48 -0.190877 -0.202549 -0.183474
49 49 0.071189 0.137719 0.053513
/home/disimone/.local/lib/python2.7/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Comments

  • KNeighboursRegressor:
    • The mean absolute error shows a very regular behavior as a function of the training fraction, quite similar to a power law, as one would expect from a progressive increase of the density of neighbours (i.e. increased precision of the interpolation).
    • There does not appear to be any significant gain in predicting the total users starting from the two seaparate components (registered/casual).
    • While the mean absolute error for a 10% training fraction is around 85, the residual plot shows that the residuals tend to be significantly larger for larger number of total users. This is moslty due to the registered users (which is, on the other hand, the dominant component of the total users)
  • LinearRegression, Ridge
    • The two linear models perform in an overall comparable way. The absolute mean error is stable above a 10% training training fraction
    • No gain is observed by splitting the total users in the two components
    • In the residual plots it is here even more evident that for large numbers of users, the prediction is systematically lower than the true value
    • Inspection of the coefficients of the models reveals that:
      • The weathersit feature is only weakly correlated to the registered users, while it is anti-correlated to the casual users, which seems to suggest that registered users are those who bike because they have to, e.g. to commute to work. This seems to agree with what seen for other weather-related features (like atemp, hum, windspeed)
      • This is confirmed by the very strong correlation of season with the casual users, which is not true in the case of the registered users, and by the oppite-sign correlation of holiday with registered and casual users
  • SVR
    • This model, while not performing as good as the linear models or the KNeighboursRegressor, shows a significant improvement when predicting the registered and casual users separately, which may be worth a longer investigation, maybe trying different configurations of the algorithm
    • The residuals for the registered users are strongly correlated with the true value, mush more strongly than observed for the other models, which may be one of the reasons for the sub-optimal performance in terms of mean absolute error
  • MLPRegressor
    • This model still shows some improvements from the splitting of the total users in registered and casual users, and reaches performances comparable to the linear models, with no worse-looking residual plots

to top

In [ ]: